Using SQL and Plotly to Perform Analysis on Temperature Data

HW 1
Author

Alvin

Part 1: Constructing a SQL Database

In this blog, we’ll be using SQL and plotly to construct aesthetic graphs. The sqlite3 package allows for creating and accessing SQL databases in Python.

First, some basic imports:

import pandas as pd
import sqlite3

You should already be fairly familiar with working with Pandas dataframes. To work with SQL, first we need to establish a connection and create a cursor that can execute queries for us.

conn = sqlite3.connect("data.db")
cursor = conn.cursor()

For this post we’ll be using temperature data collected from various temperature reading stations around the world. The to_sql() function imports a given dataframe into our SQL database.

def to_db(url,name):
    '''
    Reads given .csv file into the established SQL database with given name
    '''
    df = pd.read_csv(url)
    df.to_sql(name,conn,index=False,if_exists="replace")
    
to_db("temps_stacked.csv","temperatures")
to_db("countries.csv","countries")
to_db("station-metadata.csv","stations")

Now we can check that we have three different tables in our database:

cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
cursor.fetchall()
[('temperatures',), ('countries',), ('stations',)]

Part 2: Extracting Data

Our end goal is to create a graph that addresses the following question:

How does the average yearly change in temperature vary within a given country?

To this end, we need a function to extract the specific data that we want from our database. The key here is knowing how to construct the appropriate SQL command. A tutorial on basic SQL syntax can be found here: https://www.w3schools.com/sql/default.asp

For our function, we want to be able to specify a year range as well as a month, the latter so that temperature variations as seasons change don’t affect the data.

def query_climate_database(country,year_begin,year_end,month):
    '''
    Returns Pandas dataframe of data from SQL database
    where station is located in country and reading is
    from month and in range year_begin to year_end
    '''
    
    #extract FIPS 10-4 from country
    cursor.execute(f"SELECT [FIPS 10-4] FROM countries WHERE Name='{country}' LIMIT 1")
    fips = cursor.fetchone()[0]
    
    #query SQL database for data
    cmd = f"SELECT S.name,S.latitude,S.longitude,C.Name,T.temp,T.year,T.month \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    LEFT JOIN countries C \
    ON C.[FIPS 10-4] = SUBSTRING(T.id,1,2) \
    WHERE (year >= {year_begin}) \
    AND (year <= {year_end}) \
    AND (month == {month}) \
    AND (SUBSTRING(T.id,1,2) == '{fips}')"
    
    return pd.read_sql_query(cmd,conn)

Here’s an example of what our returned dataframe looks like:

df=query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
df
NAME LATITUDE LONGITUDE Name Temp Year Month
0 PBO_ANANTAPUR 14.583 77.633 India 23.48 1980 1
1 PBO_ANANTAPUR 14.583 77.633 India 24.57 1981 1
2 PBO_ANANTAPUR 14.583 77.633 India 24.19 1982 1
3 PBO_ANANTAPUR 14.583 77.633 India 23.51 1983 1
4 PBO_ANANTAPUR 14.583 77.633 India 24.81 1984 1
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 5.10 1983 1
3148 DARJEELING 27.050 88.270 India 6.90 1986 1
3149 DARJEELING 27.050 88.270 India 8.10 1994 1
3150 DARJEELING 27.050 88.270 India 5.60 1995 1
3151 DARJEELING 27.050 88.270 India 5.70 1997 1

3152 rows × 7 columns

Part 3: Plotting

Now we’re ready to tackle our question. To calculate the average yearly change in temperature, we’ll be creating a linear regression of temperature vs. year and extracting the linear coefficient. For this to be meaningful, our function will be taking in a min_obs parameter for the minimum amount of observations required for a station to be considered. If a station doesn’t have enough observations, we remove it from the dataframe before plotting.

Before we implement this in a function, let’s work through it with our India data. We can use the groupby() function along with the transform() function which, in contrast to apply(), retains the indexing of the original dataframe.

#removes observations from stations with less than 10 observations
df[df.groupby("NAME")["Year"].transform(lambda x:x.count())>=10]
NAME LATITUDE LONGITUDE Name Temp Year Month
0 PBO_ANANTAPUR 14.583 77.633 India 23.48 1980 1
1 PBO_ANANTAPUR 14.583 77.633 India 24.57 1981 1
2 PBO_ANANTAPUR 14.583 77.633 India 24.19 1982 1
3 PBO_ANANTAPUR 14.583 77.633 India 23.51 1983 1
4 PBO_ANANTAPUR 14.583 77.633 India 24.81 1984 1
... ... ... ... ... ... ... ...
3140 SHILONG 25.600 91.890 India 10.40 1986 1
3141 SHILONG 25.600 91.890 India 11.20 1990 1
3142 SHILONG 25.600 91.890 India 11.99 2010 1
3143 SHILONG 25.600 91.890 India 9.93 2011 1
3144 SHILONG 25.600 91.890 India 9.68 2012 1

3106 rows × 7 columns

Notice that we’ve removed data from the DARJEELING station, which only had 7 observations.

Next, we import from sklearn, which you should already be familiar with, and write a function that finds the average yearly temperature increase using a linear regression model. We can then use this to add a column to the dataframe.

from sklearn.linear_model import LinearRegression

def lin_coeff(data):
    '''
    Returns linear coefficient of year vs. temp regression fitted to data
    '''
    reg = LinearRegression().fit(data[["Year"]],data["Temp"])
    return round(reg.coef_[0],4)

data=df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
data
NAME LATITUDE LONGITUDE 0
0 AGARTALA 23.8830 91.2500 -0.0062
1 AGRA 27.1667 78.0333 -0.0954
2 AHMADABAD 23.0670 72.6330 0.0067
3 AKOLA 20.7000 77.0330 -0.0018
4 AKOLA 20.7000 77.0670 -0.0059
... ... ... ... ...
101 TRIVANDRUM 8.5000 77.0000 0.0229
102 UDAIPUR_DABOK 24.6170 73.8830 0.0724
103 VARANASI_BABATPUR 25.4500 82.8670 -0.0130
104 VERAVAL 20.9000 70.3670 0.0248
105 VISHAKHAPATNAM 17.7170 83.2330 -0.0340

106 rows × 4 columns

Things are looking pretty good! It’s not ideal that our column title is just “0”, but at this point we’re ready to write our plotting function using what we’ve just figured out. We’ll be relying on the plotly.express module to do so.

from plotly import express as px
import plotly.io as pio 

pio.renderers.default= 'iframe'
month_dict = {1 : "January",
              2 : "February",
              3 : "March",
              4 : "April",
              5 : "May",
              6 : "June",
              7 : "July",
              8 : "August",
              9 : "September",
              10 : "October",
              11 : "November",
              12 : "December"}

def temperature_coefficient_plot(country,year_begin,year_end,month,min_obs,**kwargs):
    '''
    Plots average yearly increase in temperature at stations in given country
    
    Parameters:
    country - country from which station data is taken
    year_begin - earliest year from which data is taken
    year_end - latest year from which data is taken
    month - month from which data is taken (prevents seasonal changes in temperature
                                            from influencing data)
    min_obs - minimum number of observations required per station
    '''
    df = query_climate_database(country,year_begin,year_end,month)
    
    #remove station data without enough observations
    df = df[df.groupby("NAME")["Year"].transform(lambda x:x.count()) >= min_obs]
    
    #calculate average yearly increase at each station
    df = df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
    df = df.rename(columns={0:"Estimated Yearly Increase (\u00b0C)"})
    
    #creates scatterplot with color corresponding to change in temperature
    fig = px.scatter_mapbox(df,
                            lat = "LATITUDE",
                            lon = "LONGITUDE",
                            hover_name = "NAME",
                            color = "Estimated Yearly Increase (\u00b0C)",
                            color_continuous_midpoint = 0,
                            title = f"Average yearly increase in temperature in {month_dict[month]}<br>for stations in {country} from {year_begin} - {year_end}",
                            **kwargs)
    return fig

Here are two examples of what the graph looks like:

#set color scale for points
color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)
fig.show()
fig = temperature_coefficient_plot("China", 1975, 2015, 8, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)
fig.show()

Part 4: More Fun Stuff

Now that we’re more familiar with SQL and plotly, let’s ask some more questions we can use our tools to visualize. For example,

How does longitude affect how greatly temperature varies across time?

The process is more or less the same as before: we first define a function that uses a SQL query to return the data that we want, and then we write a plotter function that uses plotly.express.

def query_lon_temp(lat_min,lat_max):
    '''
    Returns dataframe containing longitudinal and temperature data within given latitude range
    '''
    cmd = f"SELECT S.longitude,T.temp,T.year,T.month \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    WHERE (S.latitude >= {lat_min}) AND (S.latitude <= {lat_max})"
    
    return pd.read_sql_query(cmd,conn)
def scatter_temp_lon_all(lat_min,lat_max):
    '''
    Plots standard deviation of temperature data at each longitude for stations within
    given latitude range
    '''
    df = query_lon_temp(lat_min,lat_max)
    
    #calculate standard deviation of temperature data at each longitude
    df = df.groupby("LONGITUDE").apply(lambda x:x["Temp"].std()).reset_index()
    df = df.rename(columns={0:"Temp StdDev (\u00b0C)"})
    
    fig = px.scatter(data_frame = df, 
                 x = "LONGITUDE", 
                 y = "Temp StdDev (\u00b0C)",
                 title = f"Temperature Variation vs. Longitude at Latitudes {lat_min}\u00b0 to {lat_max}\u00b0",
                 opacity = 0.5)
    return fig
scatter_temp_lon_all(-10,10)

We see the most variation around the tropics, in the middle of each hemisphere, while there is very little variation at the poles. While temperature at the equator varies somewhat, how much it varies stays relatively consistent.

A simpler question we can ask is:

What is the distribution of temperatures at each station?

Obviously there are too many stations to all be reasonably visualized, so our functions take in a list of stations to be plotted.

def query_temps(stations,month):
    '''
    Returns dataframe with data read from given stations at a particular month
    '''
    cmd = f"SELECT S.name,T.temp,T.year \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    WHERE (S.name IN {tuple(stations)}) AND (T.month == {month})"
    
    return pd.read_sql_query(cmd,conn)
def hist_temps(stations, month):
    '''
    Plots histogram of temperature distribution with stations as facets
    '''
    df = query_temps(stations, month)
    df = df.rename(columns={"Temp":"Temperature Readings (\u00b0C)","NAME":"Station"})
    
    fig = px.histogram(df, 
                       x = "Temperature Readings (\u00b0C)",
                       nbins = 30,
                       facet_row = "Station",
                       title = f"Distribution of Temperature Readings (\u00b0C) in {month_dict[month]}")
    
    return fig

Here’s what it looks like:

stations = ["DUBAI_INTL","HERAT","SAVE"]
hist_temps(stations,3)

We’re pretty much done, but one last thing: don’t forget to close your connection to the database!

conn.close()